Fault Tolerant Master-Worker over a Multi-Cluster Architecture
نویسندگان
چکیده
The growth of clusters into cluster collections increases potential points of failures, requiring the implementation of a fault-tolerance scheme. The CoHNOW is organized as a hierarchical master-worker scheme and clusters may be geographically distributed and interconnected by Internet. This paper describes a system of Fault-Tolerant protection by Data Replication (FT-DR), based on preserving critical functions by on-line dynamic data replication. The system-model target is to detect failures in any of the system functional elements and to tolerate this failure by recovering system consistency, guaranteeing the completion of the work in progress (recovery procedure). The model is designed to tolerate more than one simultaneous failure. There are three distinct phases for model-fault tolerance activities: startup, normal execution including failure detection monitoring, and failure recovery. The system is oriented for general master-worker applications running on CoHNOW and is transparent both for user and application. The master-worker environment requirements to support all these capabilities and the runtime overhead are under evaluation.
منابع مشابه
A New Fault Tolerant Nonlinear Model Predictive Controller Incorporating an UKF-Based Centralized Measurement Fusion Scheme
A new Fault Tolerant Controller (FTC) has been presented in this research by integrating a Fault Detection and Diagnosis (FDD) mechanism in a nonlinear model predictive controller framework. The proposed FDD utilizes a Multi-Sensor Data Fusion (MSDF) methodology to enhance its reliability and estimation accuracy. An augmented state-vector model is developed to incorporate the occurred senso...
متن کاملDistributed Parallel Processing Based on Master/Worker Model in Heterogeneous Computing Environment
Due to the complexity and varying requirements of the applications to utilize large-scale computing resources, there are several issues such as aggregating heterogeneous computing resources, easy-touse programming model and fault-tolerant mechanism that need to be addressed. This paper presents a general distributed parallel processing architecture based on master/worker model, and it can aggre...
متن کاملParallelization of K-Means Clustering on Multi-Core Processors
Multi-core processors have recently been available on most personal computers. To get the maximum benefit of computational power from the multi-core architecture, we need a new design on existing algorithms and software. In this paper we propose the parallelization of the well-known k-means clustering algorithm. We employ a single program multiple data (SPMD) approach based on a message passing...
متن کاملDevelopment and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations
A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations mus...
متن کاملScalable And Fault Tolerant Hierarchical B&B Algorithms For Computational Grids
Solving to optimality large instances of combinatorial optimization problems using Branch and Bound (B&B) algorithms requires a huge amount of computing resources. Nowadays, such power is provided by large scale environments such as computational grids. However, grids induce new challenges: scalability, heterogeneity, and fault tolerance. Most of existing gridbased B&Bs are developed using the ...
متن کامل